feat(seccomp): ExtraHandler — user-supplied syscall handlers#20
feat(seccomp): ExtraHandler — user-supplied syscall handlers#20dzerik wants to merge 5 commits intomultikernel:mainfrom
Conversation
5f2b730 to
71c5724
Compare
|
Thanks for the PR! Two main issues:
|
1d0783d to
431c207
Compare
|
You were right on both counts — the missing BPF plumbing was the 1. BPF plumbing
While wiring this up I noticed an adjacent footgun: the cBPF program emits 2. Tests that actually exercise dispatchThe unit-level
Cosmetic
Deliberately deferred
Diff stat: 8 files, ~1046 / -9. All 215 unit tests pass; integration |
|
No test for policy.deny_syscalls ? |
| non-[`NotifAction::Continue`](../crates/sandlock-core/src/seccomp/notif.rs) | ||
| result wins. | ||
|
|
||
| This patch exposes a **public extension point**: |
There was a problem hiding this comment.
Please make it a formal document, it reads like a patch description.
There was a problem hiding this comment.
Rewrote as a design doc. Removed PR-narrative phrasing ("this patch", "available since 0.7 (branch ...)", "two concrete use cases motivate this API"); reorganised: API → Semantics (Ordering, Return values, Continue-site safety) → Security boundary (Extras can/cannot, BPF coverage, Deny-list bypass guard) → Panics → Use cases → Limitations → Backwards compatibility → Downstream usage. Present tense throughout; ordering invariants now reference both the unit and integration tests inline.
|
|
||
| // builtin is index 0, extra is index 1 | ||
| let chain = table.chains.get(&libc::SYS_openat).unwrap(); | ||
| assert_eq!(chain.handlers.len(), 2, "two handlers expected"); |
There was a problem hiding this comment.
Right — those tests only counted handlers (chain.handlers.len()); same Vec::push-not-the-contract pattern you flagged the first time, just on a new test. Replaced both with three that drive DispatchTable::dispatch directly against a minimal SupervisorCtx (built from the per-state new()s — no kernel, no fds, handlers ignore ctx):
dispatch_walks_chain_in_registration_order— three handlers all returningContinue, asserts the recorded tag sequence is[1, 2, 3].dispatch_runs_builtin_before_extra— builtin viaregister, extra viaExtraHandler; asserts[B, E]andContinuepropagates.dispatch_stops_at_first_non_continue— first handler returnsErrno(EACCES); asserts the second handler never runs and the errno surfaces.
End-to-end ordering stays exercised by the integration tests (extra_handler_runs_after_builtin_returns_continue, builtin_non_continue_blocks_extra, chain_of_extras_runs_in_insertion_order).
431c207 to
c602c47
Compare
Symmetric to the existing default-deny coverage, exercises the user-specified branch of `deny_syscall_numbers` (when `Policy::deny_syscalls` is set, it overrides DEFAULT_DENY). Without this branch covered, a future regression in user-list resolution would silently let an extra register on a caller-denied syscall and `Continue` would translate to `SECCOMP_USER_NOTIF_FLAG_CONTINUE`, bypassing the kernel deny. Both tests use `SYS_mremap`: it is recognised by `syscall_name_to_nr` but not present in `DEFAULT_DENY_SYSCALLS`, so it lands on the deny list only via the user-supplied branch — isolating that path from the default-deny path covered by `extra_handler_on_default_deny_syscall_is_rejected`. - Unit (`seccomp::dispatch::extra_handler_tests:: validate_extras_rejects_user_specified_deny`): drives `validate_extras_against_policy` directly, no kernel dependency, so the contract is enforced even on hosts where seccomp integration tests are skipped. - Integration (`test_extra_handlers:: extra_handler_on_user_specified_deny_is_rejected`): drives the full `Sandbox::run_with_extra_handlers` rejection path; asserts the offending syscall number is surfaced in the error. Addresses review feedback on PR multikernel#20. Signed-off-by: dzerik <dzerik@gmail.com>
|
Added.
Both use Rebased on
All 227 |
Adds a public extension point for downstream crates that need to
register their own seccomp-notification handlers alongside sandlock's
builtin chroot/cow/procfs/network/port_remap logic.
Motivation: downstream crates that want to intercept additional
syscalls in the same supervisor task as sandlock's builtins have no
clean way to do it today — one SECCOMP_FILTER_FLAG_NEW_LISTENER per
process means a single listener, so a second supervisor cannot run
alongside. The only alternative is forking sandlock or patching
notif::supervisor wholesale.
API:
- New type dispatch::ExtraHandler { syscall_nr, handler }.
- New entry Sandbox::run_with_extra_handlers(policy, cmd, extras).
- Existing Sandbox::run() delegates to it with empty extras — zero
behaviour change for current callers.
Ordering contract (documented + tested):
- Builtins register first (chroot path normalization, COW, procfs, …).
- Extras appended last, in the Vec order.
- Chain stops at first non-Continue — user handlers cannot subvert
builtin confinement.
BPF coverage (this is what plumbs extras to the kernel):
- Sandbox::do_spawn collects the syscall numbers from extra_handlers
and threads them into the child via the new ChildSpawnArgs.extra_syscalls
field on context::confine_child.
- The child merges them into notif_syscalls(policy) before
bpf::assemble_filter, with sort + dedup so a syscall registered both
by a builtin and an extra produces a single JEQ.
- Without this step the kernel would never raise USER_NOTIF for a
syscall that has no builtin handler — the dispatch table would
receive nothing and the user handler would silently never fire.
Default-deny bypass guard:
- The cBPF program emits notif JEQs before deny JEQs, so a syscall
present in both lists hits SECCOMP_RET_USER_NOTIF first. An extra
on a DEFAULT_DENY syscall would therefore convert a kernel-deny into
a user-supervised path, and a Continue from the handler would
silently bypass deny.
- Sandbox::run_with_extra_handlers now validates extras against the
policy's deny list at registration time via
dispatch::validate_extras_against_policy and returns
SandboxError::Child naming the offending syscall — no silent footgun.
Internals:
- build_dispatch_table now takes Vec<ExtraHandler> and drains it into
register() calls after builtins.
- notif::supervisor signature extended to accept extras and pass them
through. sandbox.rs moves self.extra_handlers via std::mem::take
on spawn (HandlerFn is Box<dyn Fn> — not Clone).
- confine_child's seven positional parameters packed into
context::ChildSpawnArgs to keep the call site readable.
Docs:
- docs/extension-handlers.md: design rationale, security boundary,
panics policy, non-goals, downstream sketch. Adds §3.0 (BPF-filter
merge semantics) and §3.0.1 (default-deny bypass guard); corrects
the NotifAction variant table (ReturnValue, Kill { sig, pgid }).
- crates/sandlock-core/examples/openat_audit.rs: runnable example.
Tests:
- 4 unit tests on dispatch::extra_handler_tests (ctor, insertion
order, append-after-builtin, empty-extras nop).
- 7 integration tests under tests/integration/test_extra_handlers.rs
exercising the full kernel path:
* extra on SYS_uname (not intercepted by any builtin) returning
Errno(EACCES) reaches the guest;
* Continue lets the kernel resume the syscall;
* empty extras vector preserves baseline behaviour;
* cross-handler ordering: extra on SYS_openat fires after the
/proc-virtualization builtin returns Continue;
* registration on SYS_mount (DEFAULT_DENY) is rejected up-front
with a descriptive error;
* builtin non-Continue blocks extra: openat on /proc/1/cmdline is
rejected by the procfs builtin and is never observed by the
extra (path inspected via process_vm_readv), while a peer
openat on /etc/hostname is observed — proves the chain stops at
first non-Continue end-to-end through the kernel;
* chain of two extras on the same syscall: first returns Continue,
second returns Errno(EACCES) — both counters increment in lock
step (insertion order preserved) and the guest sees the EACCES.
- All 215 unit tests pass; the 178-test integration suite passes
modulo the pre-existing 54-test failure set observed on origin/main
(kernel/capability environment, unrelated to this change).
Minor bump 0.6 → 0.7 suggested.
Signed-off-by: dzerik <dzerik@gmail.com>
Symmetric to the existing default-deny coverage, exercises the user-specified branch of `deny_syscall_numbers` (when `Policy::deny_syscalls` is set, it overrides DEFAULT_DENY). Without this branch covered, a future regression in user-list resolution would silently let an extra register on a caller-denied syscall and `Continue` would translate to `SECCOMP_USER_NOTIF_FLAG_CONTINUE`, bypassing the kernel deny. Both tests use `SYS_mremap`: it is recognised by `syscall_name_to_nr` but not present in `DEFAULT_DENY_SYSCALLS`, so it lands on the deny list only via the user-supplied branch — isolating that path from the default-deny path covered by `extra_handler_on_default_deny_syscall_is_rejected`. - Unit (`seccomp::dispatch::extra_handler_tests:: validate_extras_rejects_user_specified_deny`): drives `validate_extras_against_policy` directly, no kernel dependency, so the contract is enforced even on hosts where seccomp integration tests are skipped. - Integration (`test_extra_handlers:: extra_handler_on_user_specified_deny_is_rejected`): drives the full `Sandbox::run_with_extra_handlers` rejection path; asserts the offending syscall number is surfaced in the error. Addresses review feedback on PR multikernel#20. Signed-off-by: dzerik <dzerik@gmail.com>
The integration tests and extra_handler_ctor_preserves_fields use full parameter names (_notif, _ctx, _fd). The dispatch ordering unit tests inherited the abbreviated form (_n, _c, _f) from earlier iterations of this test mod; align them with the rest for in-mod consistency. No behaviour change — all 228 sandlock-core unit tests and 8 extra_handlers integration tests still pass. Signed-off-by: dzerik <dzerik@gmail.com>
Downstream crates that use Sandbox::run_with_extra_handlers need to inspect and modify guest memory from inside their handlers — read the path argument of openat, copy a write() buffer into a chunked S3 upload, synthesise a fake getdents64 directory listing. The TOCTOU-safe wrappers in seccomp::notif were already correct (id_valid before + after process_vm_readv / process_vm_writev), they were just pub(crate). Promote them to pub. The internal *_vm variants remain private. NotifError and SeccompNotif are already public, so this completes the guest-memory-access API surface for ExtraHandlers. Signed-off-by: dzerik <dzerik@gmail.com>
c602c47 to
9f7d1b2
Compare
|
Apologies — missed the two inline review comments on my previous pass and only addressed the issue-level "No test for policy.deny_syscalls" question. Both inline points addressed now (replied in-thread on each); summary of what landed:
The Latest head: |
The arm64 Ubuntu runner (and aarch64 hosts in general) do not have
`/lib64` — the dynamic linker lives under `/lib/aarch64-linux-gnu/`.
A strict `fs_read("/lib64")` makes Landlock refuse to add the rule
and the child exits before completing confinement, surfacing in the
parent as `pipe closed before 4 bytes read`.
Switch to `fs_read_if_exists("/lib64")` to mirror the convention
already used by `test_dry_run`, `test_fork`, `test_netlink_virt`,
and `test_landlock`. Brings the six previously-failing tests on
the `ubuntu-24.04-arm` CI job back to green; x86_64 unchanged.
Signed-off-by: dzerik <dzerik@gmail.com>
|
Two follow-ups after the previous push: 1. arm64 CI fix. The 2. Latest head: |
|
The FFI (sandlock-ffi) and Python SDK (python/src/sandlock/) are not updated together with Rust? |
|
Really nice security work here — the deny-list bypass guard and the BPF notif-list merge show genuine kernel-level care, and the test suite (especially The let audit: HandlerFn = Box::new(move |notif, _ctx, _fd| {
let counter = Arc::clone(&counter_clone);
Box::pin(async move {
let n = counter.fetch_add(1, Ordering::SeqCst) + 1;
eprintln!("[audit #{n}] pid={} openat", notif.pid);
NotifAction::Continue
})
});Outer A trait-based shape gets there in roughly the same LOC and keeps closures working via a blanket impl: #[async_trait]
pub trait Handler: Send + Sync + 'static {
async fn handle(&self, cx: &HandlerCtx<'_>) -> NotifAction;
}
pub struct HandlerCtx<'a> {
pub notif: SeccompNotif,
pub sup: &'a SupervisorCtx,
pub notif_fd: RawFd,
}
// Closures still work — no Box::new, no Box::pin at the call site.
#[async_trait]
impl<F, Fut> Handler for F
where
F: Fn(&HandlerCtx<'_>) -> Fut + Send + Sync + 'static,
Fut: Future<Output = NotifAction> + Send + 'static,
{ ... }Then the example becomes: struct OpenAudit { count: AtomicUsize }
#[async_trait]
impl Handler for OpenAudit {
async fn handle(&self, cx: &HandlerCtx<'_>) -> NotifAction {
let n = self.count.fetch_add(1, Ordering::SeqCst) + 1;
eprintln!("[audit #{n}] pid={} openat", cx.notif.pid);
NotifAction::Continue
}
}Zero While we're touching this surface, two related cleanups worth folding in:
None of this changes the security work — Timing argument: sandlock is at 0.7 and Happy to sketch the migration as a separate PR on top if useful — the security boundary you've built is the load-bearing part and I don't want this comment to block that landing. |
|
Thanks for both — security approval makes the load-bearing part landable, and the API-shape feedback is exactly the kind of pre-pin tightening that's cheaper now than after a downstream crate ships against Strong preference for the same scope split you suggested: land the security part as-is, peel the rest into a follow-up chain. Concrete proposal, in landing order: 1. #20 (this PR) — 2. Follow-up A — user-facing reshape: 3. Follow-up B — FFI parity ( Reasoning for the ordering: writing FFI against the current Two scope-questions to make sure I'm not missing intent:
|
Summary
Adds a public extension point for downstream crates that need to register their own seccomp-notification handlers alongside sandlock's builtin chroot/cow/procfs/network/port_remap logic.
Motivation. Downstream crates that want to intercept additional syscalls in the same supervisor task as sandlock's builtins have no clean way to do it today — one
SECCOMP_FILTER_FLAG_NEW_LISTENERper process means a single listener, so a second supervisor cannot run alongside. The only alternative is forking sandlock or patchingnotif::supervisorwholesale.API.
dispatch::ExtraHandler { syscall_nr, handler }.Sandbox::run_with_extra_handlers(policy, cmd, extras).Sandbox::run()delegates to it with empty extras — zero behaviour change for current callers.Ordering contract (documented + tested).
Vecorder.Continue— user handlers cannot subvert builtin confinement.Docs.
docs/extension-handlers.md: design rationale, security boundary, panics policy, non-goals, downstream sketch.crates/sandlock-core/examples/openat_audit.rs: runnable example.Minor bump
0.6 → 0.7suggested.Test plan
dispatch::extra_handler_tests(ctor, insertion order, append-after-builtin, empty-extras nop) — passingopenat_audit.rsruns against apython3 -cguest